if (!requireNamespace("GSA", quietly = TRUE)){
install.packages("GSA")}
if (!requireNamespace("kable", quietly = TRUE)){
install.packages("kable")}
My data set from GEO (Gene Expression Omnibus) is GSE131222 of Transcriptional control of sub-type switching ensures adaptation and growth of pancreatic cancer. My gene set aims to see the underlying ectopic expression of GLI2 expression in human pancreatic cancer cell lines specifically looking at YAPC cells. Moreover, 1 ug/ml of Dox treatment allowed for over expression analysis of GLI2 by measuring via RNA-seq. on Illumina hiseq 2000. The authors used TopHat and Cufflinks for quality control and later used qRT-PCR along with SYBR Green assays to measure the intensity. By concluding oncogene genes and detecting induced-higher expression values compare to expression values 6 days before, GLI2 regulation of genes associated with basal-like sub-type switching was detected
Firstly, the data set I obtained from GEO was cleaned and filtered based on duplication. The normalized values of genes that are mostly miRNA genes are eliminated from the gene set. There have been instances that one row of expression has multiple genes associated. This was later found out on the second assignment where we believe this did not cause any problem, due to our limited analysis in the first assignment. Later I analyzed the dispersion among treatments of both triplicate controls that contain empty plasmid of GLI2 and experiments that contain inducible GLI2 plasmid! Since there might be a discrepancy among the gene annotation, which can cause a misunderstanding of our later analysis, I validated my HUGO gene symbols. Furthermore, by identifying HUGO symbols that are respective or similar to those that our clean data set contains. I was able to conclude that my data set is clean, contains no duplicates as well as ready for the later analyses which are normalization, differential expression calculation, non-thresholded & thresholded gene enrichment analyses, fold expression based on the fitted model of GLI2 expression set, and finally for the enrichment mapping with post analysis.
Secondly, by preliminary analyzing the differential expression of our data set via heat map I have concluded that there is a strong difference between the treatment of GLI2 and the control cells in terms of their summarized gene expression by the model. Later, by visualizing the p-value distribution I concluded most of the oncogenes referred on the main article are clustered at the specified p-value interval. Later sample clustering summarized the gene expression difference between the treatment and the control genes. Stringently analyzing Lastly, the non-thresholded gene enrichment map both allowed me to observe genes that are passing our permissive FDR values as well as defined the pathways that are found to be associated with oncogenes. These pathways are then altered visualized and used to conclude that by doing all the analysis on assignment 2, I was able to differentiate the genes that are related to sub-type switching (EMT) and those that are not (non-EMT). Finally, I have separated two data from each other and obtained a ranked gene list based additionally. These analyses taught me how to select a data set, clean a data set, normalize it, analyze to what extent the genes are expressed differentially from one another, and see how the genes are interacting genes in a given pathway or a given gene set. Also, finally, validate the author’s conclusion or come up with my conclusion and promote further research in the area. In this final assignment of mine, I will be concluding my analyses with enrichment mapping by using Cytoscape and its apps as well as conservatively gene set.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/GSEA.png")
Figure 1 GSEA running on Pre-ranked Gene List Panel: I used no collapse to remap gene symbols to avoid confusion. 200 max gene size and 15 min gene size to have more stringent values. I used 1000 permutations that allowed for a shorter running time.
I have used GSEA (4.3.2) (Merico et al, 2010) non-thresholded gene set enrichment method. I have used the Bader Lab gene set of Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt including no HUGO symbols and IEA symbols published on March 1, 2021. It was released in 2023 by Human MiSigDB. Therefore, GSEA resulted in more stringent non-EMT values that are shown to be enriched. There are pathways of genes that are known to be involved in apoptosis. This makes sense because, with my previous results that my log CPM is always negative, there are no positive log CPM values. This was my previous hypothesis that almost every gene treated showed higher expression with the treatments the authors did to induce oncogenic roles.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/enrichment.png")
Figure 2 GSEA Enrichment Results: GSE results indicate there are 4703 upregulated gene sets and 38 down regulated gene sets.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/neg.png")
Figure 3 Negative Enrichment Results: The top gene set for negative is PHOSPHOLIPASES%HUMANCYC%LIPASYN-PWY with 0 FDR and 0 p-value and-0.65 ES.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/pos.png")
Figure 4 Positive Enrichment Results: The top gene set at the positive gene set is TRANSCRIPTIONAL REGULATION BY SMALL RNAS%REACTOME%R-HSA-5578749.3 with 0.784 FDR-value and 0.001 Nominal p-values and 0.98 ES score.
There are pathways of genes that are known to be involved in apoptosis. This makes sense because, with my previous results that my log CPM is always negative, there are no positive log CPM values. This was my previous hypothesis that almost every gene treated showed higher expression with the treatments the authors did to induce oncogenic roles.
Most of the pathways are viral infection pathways. Furthermore, some viruses stimulate pancreas cancer and liver cancer as well such as chronic Hepatitis C and chronic Hepatitis B (Kamiza et al, 2022) A significant proportion of these pathways are systematic disease-causing viruses such as Epstein Barr virus and HIV-1 virus. The negative results contain more oncogenic pathways such that the apoptosis and sialic acid synthesis gene sets are more prevalent as opposed to positive. There is the separation of oncogenic pathways between positive and negative. The second assignment gene enrichment results were more clear in terms of gene pathway role. The difference between EMT genes vs non-EMT genes is more apparent in the thresholded analysis. the gene enrichment results in the non-thresholded analysis are more scarce and do not give significantly enriched results. Even though my parameters are stringent, I am not getting pathways that are adequate in terms of their role.
Finally, the quantitative difference between the thresholded and the non-thresholded is the the thresholded analysis has 771 downregulated genes sets that are less than 0.001 p value and there are 1128 upregulated gene sets. It is not a straight forward comparison because the difference between non-threholded and thresholded is high as 4703 upregulated genes (non-thresholded) - 1128 upregulated genes (thresholded) = 3575 gene sets.Also, the downregulated gene set difference is more substantial (733) gene sets more in thresholded analysis.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/pathway.png")
Figure 5 Preliminary Cytoscape Pathway Enrichment Visualization: Both apoptotic and viral gene set are separated from each other. Mesenchymal apoptotic gene sets are apparent. Both upregulated (Red) and downregulated (Blue) gene sets are observed. This includes the whole enrichment map.
The enrichment map Cytoscape visualization was done by using stringent 0.05 p-value and ver permissive 0.5 FDR values (Kamiza et al. 2022) The reason why I used 0.5 for FDR is that. Furthermore, Jaccard + Overlap coefficient 0.375 and combined coefficient 0.5 were used. In total there are 291 nodes (including hidden) and 763 edges (including hidden edges) in the visualization. There are more gene sets that I find to be associated with each other.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/publication.png")
Figure 6 Publication Ready Annotation Figure of Gene Enrichment visualization (Shannon et al, 2003): By annotating and analyzing in terms of positivity and negativity the enrichment visualization showed up regulated gene sets to be collapsed in viral export RNA theme. Most of the down regulated are not associated with one another.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/parameters.png")
Figure 7 Annotation parameters: The Default parameters are used due to small visualization and for proper observation. The annotations fit the model as there is strong evidence of the previous connections among cancer and viral genes indicated.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/cluster.png")
Figure 8 Publication ready figure of Gene Enrichment visualization: Collapsed themes indicate that there is indeed relationship between abortive viral RNA gene sets and the export viral RNA gene set.
Apoptotic gene sets are related to those of viral infection gene sets. This supports the presence of viral effects on pancreas cancer as well as many other cancer types. I created the publication-ready figure by using default parameters of MCL cluster 3 max word per label, minimum word occurrence of 1, and adjacent word bonus of 8 with WordCloud label algorithm. By collapsing the enrichment map I found 24 major themes on my map. I do not observe any novel pathways that appeared after total collapsing.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/sapiens.png")
Figure 9 (Extra!) GeneMania gene enrichment pathway results: Most of the genes are virus related (Zuberi et al, 2013) The complete gene set GeneMania analysis to see if the genes in the respective pathways on the enrichment map are associated. I noted viral genes mostly. The yellow node is the cancer gene assocaited with viral genes (black)
Genemania <- read.csv(file=file.path(getwd(),"data","table_genemania.csv"))
knitr::kable(Genemania, type="html")
| EnrichmentMap..Expression | gene.name | log.score | name | node.type | query.term | score | shared.name |
|---|---|---|---|---|---|---|---|
| 0.0000 | NUP50 | -0.4593206 | H__sapiens__1_-774733 | query | NUP50 | 0.6317127 | H__sapiens__1_-774733 |
| 0.0000 | NUP205 | -0.3864459 | H__sapiens__1_-782635 | query | NUP205 | 0.6794675 | H__sapiens__1_-782635 |
| 0.0000 | RANBP2 | -0.6944706 | H__sapiens__1_-782386 | query | RANBP2 | 0.4993387 | H__sapiens__1_-782386 |
| 0.0000 | TPR | -0.4513919 | H__sapiens__1_-773361 | query | TPR | 0.6367412 | H__sapiens__1_-773361 |
| 0.0000 | POM121 | -0.3202072 | H__sapiens__1_-789815 | query | POM121 | 0.7259986 | H__sapiens__1_-789815 |
| 0.0000 | NUP88 | -0.4133439 | H__sapiens__1_-776294 | query | NUP88 | 0.6614348 | H__sapiens__1_-776294 |
| 0.0000 | NUP188 | -0.4694928 | H__sapiens__1_-774764 | query | NUP188 | 0.6253193 | H__sapiens__1_-774764 |
| 0.0000 | NUP133 | -0.4971920 | H__sapiens__1_-773810 | query | NUP133 | 0.6082362 | H__sapiens__1_-773810 |
| 0.0000 | RAE1 | -0.5025908 | H__sapiens__1_-775191 | query | RAE1 | 0.6049613 | H__sapiens__1_-775191 |
| 35.8284 | NUP160 | -0.4609759 | H__sapiens__1_-773223 | query | NUP160 | 0.6306679 | H__sapiens__1_-773223 |
| 0.0000 | NUP155 | -0.4669676 | H__sapiens__1_-776946 | query | NUP155 | 0.6269004 | H__sapiens__1_-776946 |
| 0.0000 | NUP54 | -0.4274837 | H__sapiens__1_-780479 | query | NUP54 | 0.6521480 | H__sapiens__1_-780479 |
| 0.0000 | NUP153 | -0.5116202 | H__sapiens__1_-778335 | query | NUP153 | 0.5995235 | H__sapiens__1_-778335 |
| 0.0000 | NUP214 | -0.5148612 | H__sapiens__1_-778618 | query | NUP214 | 0.5975835 | H__sapiens__1_-778618 |
| 0.0000 | SEC13 | -0.5645876 | H__sapiens__1_-782804 | query | SEC13 | 0.5685946 | H__sapiens__1_-782804 |
| 0.0000 | NDC1 | -0.1134623 | H__sapiens__1_-773535 | query | NDC1 | 0.8927378 | H__sapiens__1_-773535 |
| 0.0000 | NUP37 | -0.4418337 | H__sapiens__1_-774031 | query | NUP37 | 0.6428565 | H__sapiens__1_-774031 |
| 0.0000 | AAAS | -0.3917891 | H__sapiens__1_-774750 | query | AAAS | 0.6758466 | H__sapiens__1_-774750 |
| 0.0000 | NUP85 | -0.4620475 | H__sapiens__1_-778398 | query | NUP85 | 0.6299924 | H__sapiens__1_-778398 |
| 0.0000 | NUP43 | -0.4713705 | H__sapiens__1_-777773 | query | NUP43 | 0.6241463 | H__sapiens__1_-777773 |
| 0.0000 | NUP107 | -0.5350951 | H__sapiens__1_-776652 | query | NUP107 | 0.5856136 | H__sapiens__1_-776652 |
| 0.0000 | NUP93 | -0.4014051 | H__sapiens__1_-775466 | query | NUP93 | 0.6693788 | H__sapiens__1_-775466 |
| 0.0000 | NUP210 | -0.3976326 | H__sapiens__1_-779292 | query | NUP210 | 0.6719089 | H__sapiens__1_-779292 |
| 0.0000 | NUP35 | -0.3966731 | H__sapiens__1_-783644 | query | NUP35 | 0.6725539 | H__sapiens__1_-783644 |
| 0.0000 | NUP62 | -0.4904854 | H__sapiens__1_-795021 | query | NUP62 | 0.6123291 | H__sapiens__1_-795021 |
| 0.0000 | POM121C | -0.0841121 | H__sapiens__1_-826631 | query | POM121C | 0.9193282 | H__sapiens__1_-826631 |
library(GSA)
gmt_file <- file.path(getwd(),"data",
"Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt")
capture.output(genesets <-
GSA.read.gmt(gmt_file),file="GSA_file.txt")
names(genesets$genesets) <- genesets$geneset.names
knitr::kable(head(genesets$geneset.names), type="html")
| x |
|---|
| PROTEIN CITRULLINATION%HUMANCYC%PWY-4921 |
| METHYLGLYOXAL DEGRADATION III%HUMANCYC%PWY-5453 |
| STEARATE BIOSYNTHESIS I (ANIMALS)%HUMANCYC%PWY-5972 |
| TRYPTOPHAN DEGRADATION X (MAMMALIAN, VIA TRYPTAMINE)%HUMANCYC%PWY-6307 |
| CERAMIDE BIOSYNTHESIS%HUMANCYC%PWY3DJ-12 |
| L-SERINE DEGRADATION%HUMANCYC%SERDEG-PWY |
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/p53.png")
Figure 10 Signature gene set p53 against all gene sets enriched: Post analysis coefficient of 0.25 Mann-Whitney cut-off (Two-sided) was used. Post-analysis results also show that p63 is mainly involved in reactive oxygen species (ROS) metabolism. ROS are known to be tumor growth promoters which then cause EMT in different human tissues as well as trigger cancer of different type. Portion of the previous figure was cropped, due to complications in the figure. The strong interaction with edges is apparent s
The post analysis includes the signature gene set from the Bader
lab gene sets (Kucera et al, 2016) of Human symbols of DrugBank approved
gene set. p53 is an oncogene and it is involved in many pathways that
are known to be causing different types of cancers. P53 is associated
with viral pathways as well as some oncogenic pathways. Therefore, it is
safe to assume that the viral effect on oncogenic genes for triggering
tumors is indeed true.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/NANS.png")
Figure 11 NANS (Extra!) Sialic Acid pathway of Homo sapiens(Munkley et al. ,2022): Moreover, cancer-triggered apoptosis is one of the results of sialic acid. Neu5Ac 9-phosphate synthase (NANS)( encircled red) is an enzyme in sialic acid metabolism.
The sialic acid pathway is one of the pathways that is known to be affecting tumor proliferation, cell growth, cell detachment and as a result metastasis. It is known to be highly expressed by the cancer cell. NANS’ role in sialic acid pathways on the figure. Also, glycolipid-bound sialic acids are highly found in cancer cells. Finally, by having high expression levels and visualized on the pathway analysis NANS has the potential to be an oncogene. The role of sialic acid differs as there is an interaction between cancer genes and the viral genes, I do hypothesize that cancer causing viruses are an important factor in high sialic acid levels in pancreas cancer.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/sialicacid.jpg")
Figure 12 Oncogenic rosle of Sialic acid Metabolism (Zhou et al, 2020): The figure capture hallmark of a bunch of other pathways containing sialic acid.
1.Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment 2 thresholded methods ?
Yes, the enrichment results support the conclusion. It not only supports the conclusion of EMT genes that are known to be unregulated and found on the positive side of the gene enrichment analysis but also a finding of viral effect. This viral effect on different types of cancer has been discussed in different literature. However, the cancer oncogene p53 has a relationship with oncogenic viral cancer-inducing genes. Further research is required to illustrate both pancreas cancer and much other cancer. The results that we obtained from the thresholded analysis are partly different from what we obtained from the non-thresholded analysis. I am confident that this is due to the gene-set difference as well as experimental errors in data analysis. However, we were lucky to get partial positive results and conclude similarly to the main paper’s conclusion.
2.Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result ?
(Kamiza et al, 2022) discusses p53 manipulation by oncogenic viruses and shows a clear picture of possible non-oncogenic viruses. my analysis results showed clear indications of HIV, E1b(Ebstein Barr virus) effect on cell apoptosis. Also, cellular degradation pathways are partially visualized on my enrichment map. They are also associated with viral metabolic activity such as HIV (shown on the degradation side)(Munkley et al, 2019) Viral activity is then associated with sialic acid as this paper discusses and mentions the virus effect on cancer. In conclusion my hypothesis is strengthened by the findings that the authors findings. This paves a path for further research in finding relationship of viral genes specific to the pancreas or liver and the
In the third part of the assignment the post-analysis of the GSEA results allowed to see the direct interaction between different viral cancerous genes. This paves a path for further research in viral effect on the pancreas cancer, by strengthening the conclusion that the authors made. The results from g:Profile and the GSEA correlate in a way that both of these analysis do have similar numbers of genes lower than 1% p-value. This makes sense because both of these analysis indicate the conclusion of hidden viral genes that the author did not show in their analysis.
knitr::include_graphics("/Users/mac/Desktop/BCB420/A3_MetyuMelkonyan/images/cancer.jpeg")
Figure 13: Oncogenic role of Sialic acid Metabolism(Munkley et al, 2022) : Hallmark of apoptosis, cell cycle and transcription and hallmarks of p53 stability indicate proposed mechanisms and the prospective viruses having genes assocaited with the mechanisms